Intro

The quality of the Nanostring samples varies considerably, depending on the quantity of RNA present, extraction quality, Ncounter factors, hybridization success, etc. The Nanostring method includes some internal controls that allow us to exclude some samples, but this is a fairly crude and not very stringent method. More samples than flagged need to be removed. Up to now, I have been using a somewhat ad-hoc approach, removing samples either on the basis of numbers of genes expressed (biased where we are also looking at the expression of the same genes) or simply removing outliers. Now it’s time to devise a better approach. As Michalis suggested, I will be using the “housekeeping” genes whcih are used to normalise the samples.

Ways of determining quality

-Via expression levels of housekeeping genes

-Via relative levels of housekeeping genes Should expect consistent ratios

Normalisation

So we’ll stay with the old (sum-based) normalisation method

Expression levels and replicability

Need to establish an appropriate threshold of expression above which replicability is acceptable. Look at the samples for which two Nanostring replicates are available.

And looking at the residual variation:

Does this hold true amongst the Housekeeping genes?

Anywhere above 5-10 reads starts to look somewhat reasonable here.

Using simple cutoffs?

Predicting reliability from mean Housekeeping expression

Where should we choose to place our cutoffs?

Housekeeping gene expression

To place this into context, we need to ask how strongly expressed our housekeeping genes are.

Housekeeping Gene intercorrelation

Another measure of sample quality are the ratios between housekeeping gene expression levels

PCA of Housekeeping genes

Two extreme outlier samples. Not that these samples What if we remove them?

PCA changes drastically! Any housekeeping gene ratio method will need to remove these at least (these are strongly expressed and so won’t be cut out by other methods).

If low quality samples don’t follow the same correlations between housekeeping gene expression, we can look to exclude those which don’t fit the pattern

Chi-square: Observed vs expected proportional expression of Housekeeping genes

Use chi-square tests to compare the proportial expression of

Results are very significant, but we don’t want to be so stringent as to remove all samples with a statistically sound deviation from average ratios, we just need to remove the worst sample

Here, we can simply add a cut-off on the basis of this Chi-statistic. Note that our previous “weird outliers” are thrown out by this measure.

Jensen-Shannon Divergence as a second measure of Housekeeping gene proportions

Compute the difference between two probability distributions.

The top few of these are also fairly poor samples that don’t fit expected patterns

Comparing quality measures

Proposals

I propose the following cutoffs:

-Sample unflagged by internal controls

-At least 4 housekeeping genes expressed with at least 3 reads each

-Chi-stat under 1000 for housekeeping gene ratios

This is quite stringent, cutting samples available down to:

## [1] 111

(Out of 180 samples)

Building a random forest model with stringent quality control

Not bad, but needs to be optimised further. This may be overly stringent and have cut the training dataset down too much.